The individual data includes personal and health characteristics of children in 12 communities across Southern California. The regional data include air quality measurements at the community level.
# reading the datasetschs_individual <-read.csv("/Users/neens/Downloads/chs_individual.csv")chs_regional <-read.csv("/Users/neens/Downloads/chs_regional.csv")# merging datasets based on the location variablemerged_data <-merge(chs_individual, chs_regional, by ="townname")# view the first few rows of the merged datasethead(merged_data, 5)
After merging the data, make sure you don’t have any duplicates by counting the number of rows. Make sure it matches.
# counting the number of rows in the individual and regional datasetsnrow_individual <-nrow(chs_individual)nrow_regional <-nrow(chs_regional)cat("Number of rows in the individual dataset: ", nrow_individual, "\n")
Number of rows in the individual dataset: 1200
cat("Number of rows in the regional dataset: ", nrow_regional, "\n")
Number of rows in the regional dataset: 12
# counting the number of rows in the merged datasetnrow_merged <-nrow(merged_data)cat("Number of rows in the merged dataset: ", nrow_merged, "\n")
Number of rows in the merged dataset: 1200
# checking for duplicates in the merged datasetduplicates <-sum(duplicated(merged_data))cat("Number of duplicate rows in the merged dataset: ", duplicates, "\n")
Number of duplicate rows in the merged dataset: 0
The individual and merged datasets both contain 1,200 rows, indicating that no duplicate observations are present in the merged dataset.
In the case of missing values, impute data using the average amongst individuals with the same values for the “male” and “hispanic” variables. For categorical variables, take the mode.
# function to calculate modeget_mode <-function(v) { uniqv <-unique(v[!is.na(v)]) # Exclude NA values uniqv[which.max(tabulate(match(v, uniqv)))]}# for numerical variables: Impute using group mean (based on male and hispanic)numerical_vars <-c("agepft", "height", "weight", "bmi", "fev", "fvc", "mmef", "pm25_mass", "pm25_so4", "pm25_no3", "pm25_nh4", "pm25_oc", "pm25_ec", "pm25_om", "pm10_oc", "pm10_ec", "pm10_tc", "formic", "acetic", "hcl", "hno3", "o3_max", "o3106", "o3_24", "no2", "pm10", "no_24hr", "pm2_5_fr", "iacid", "oacid", "total_acids", "lon", "lat")# impute missing numerical valuesmerged_data <- merged_data %>%group_by(male, hispanic) %>%mutate(across(all_of(numerical_vars), ~ifelse(is.na(.), mean(., na.rm =TRUE), .)))# for categorical variables: Impute using group mode (based on male and hispanic)categorical_vars <-c("townname", "race", "asthma", "active_asthma", "father_asthma", "mother_asthma", "wheeze", "hayfever", "allergy", "educ_parent", "smoke", "pets", "gasstove")# impute missing categorical valuesmerged_data <- merged_data %>%group_by(male, hispanic) %>%mutate(across(all_of(categorical_vars), ~ifelse(is.na(.), get_mode(.), .)))# ungroup after imputationmerged_data <-ungroup(merged_data)# check the updated data to make sure there are no missing valuessummary(merged_data)
townname sid male race
Length:1200 Min. : 1.0 Min. :0.0000 Length:1200
Class :character 1st Qu.: 528.8 1st Qu.:0.0000 Class :character
Mode :character Median :1041.5 Median :0.0000 Mode :character
Mean :1037.5 Mean :0.4917
3rd Qu.:1554.2 3rd Qu.:1.0000
Max. :2053.0 Max. :1.0000
hispanic agepft height weight
Min. :0.0000 Min. : 8.961 Min. :114 Min. : 42.00
1st Qu.:0.0000 1st Qu.: 9.632 1st Qu.:135 1st Qu.: 66.00
Median :0.0000 Median : 9.907 Median :139 Median : 76.00
Mean :0.4342 Mean : 9.923 Mean :139 Mean : 79.31
3rd Qu.:1.0000 3rd Qu.:10.155 3rd Qu.:143 3rd Qu.: 87.00
Max. :1.0000 Max. :12.731 Max. :165 Max. :207.00
bmi asthma active_asthma father_asthma
Min. :11.30 Min. :0.0000 Min. :0.00 Min. :0.00000
1st Qu.:15.96 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.00000
Median :17.81 Median :0.0000 Median :0.00 Median :0.00000
Mean :18.50 Mean :0.1425 Mean :0.19 Mean :0.07583
3rd Qu.:19.99 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.:0.00000
Max. :41.27 Max. :1.0000 Max. :1.00 Max. :1.00000
mother_asthma wheeze hayfever allergy
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.0975 Mean :0.3117 Mean :0.1575 Mean :0.2775
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
educ_parent smoke pets gasstove
Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :3.000 Median :0.0000 Median :1.0000 Median :1.0000
Mean :2.808 Mean :0.1583 Mean :0.7667 Mean :0.7875
3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :5.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
fev fvc mmef pm25_mass
Min. : 984.8 Min. : 895 Min. : 757.6 Min. : 5.960
1st Qu.:1827.6 1st Qu.:2065 1st Qu.:2043.7 1st Qu.: 7.615
Median :2016.4 Median :2282 Median :2392.1 Median :10.545
Mean :2030.1 Mean :2322 Mean :2398.6 Mean :14.362
3rd Qu.:2223.6 3rd Qu.:2551 3rd Qu.:2737.1 3rd Qu.:20.988
Max. :3323.7 Max. :3698 Max. :4935.9 Max. :29.970
pm25_so4 pm25_no3 pm25_nh4 pm25_oc
Min. :0.790 Min. : 0.730 Min. :0.4100 Min. : 1.450
1st Qu.:1.077 1st Qu.: 1.538 1st Qu.:0.7375 1st Qu.: 2.520
Median :1.815 Median : 2.525 Median :1.1350 Median : 4.035
Mean :1.876 Mean : 4.488 Mean :1.7642 Mean : 4.551
3rd Qu.:2.605 3rd Qu.: 7.338 3rd Qu.:2.7725 3rd Qu.: 5.350
Max. :3.230 Max. :12.200 Max. :4.2500 Max. :11.830
pm25_ec pm25_om pm10_oc pm10_ec
Min. :0.1300 Min. : 1.740 Min. : 1.860 Min. :0.1400
1st Qu.:0.4000 1st Qu.: 3.020 1st Qu.: 3.228 1st Qu.:0.4100
Median :0.5850 Median : 4.840 Median : 5.170 Median :0.5950
Mean :0.7358 Mean : 5.460 Mean : 5.832 Mean :0.7525
3rd Qu.:1.1750 3rd Qu.: 6.418 3rd Qu.: 6.855 3rd Qu.:1.1975
Max. :1.3600 Max. :14.200 Max. :15.160 Max. :1.3900
pm10_tc formic acetic hcl
Min. : 1.990 Min. :0.340 Min. :0.750 Min. :0.2200
1st Qu.: 3.705 1st Qu.:0.720 1st Qu.:2.297 1st Qu.:0.3250
Median : 6.505 Median :1.105 Median :2.910 Median :0.4350
Mean : 6.784 Mean :1.332 Mean :3.010 Mean :0.4208
3rd Qu.: 8.430 3rd Qu.:1.765 3rd Qu.:4.000 3rd Qu.:0.4625
Max. :16.440 Max. :2.770 Max. :5.140 Max. :0.7300
hno3 o3_max o3106 o3_24
Min. :0.430 Min. :38.27 Min. :28.22 Min. :18.22
1st Qu.:1.593 1st Qu.:49.93 1st Qu.:41.90 1st Qu.:23.31
Median :2.455 Median :64.05 Median :46.74 Median :27.59
Mean :2.367 Mean :60.16 Mean :47.76 Mean :30.23
3rd Qu.:3.355 3rd Qu.:67.69 3rd Qu.:55.24 3rd Qu.:32.39
Max. :4.070 Max. :84.44 Max. :67.01 Max. :57.76
no2 pm10 no_24hr pm2_5_fr
Min. : 4.60 Min. :18.40 Min. : 2.050 Min. : 9.01
1st Qu.:12.12 1st Qu.:20.71 1st Qu.: 6.487 1st Qu.:13.47
Median :16.40 Median :29.64 Median :14.080 Median :19.94
Mean :18.99 Mean :32.64 Mean :16.218 Mean :19.80
3rd Qu.:23.24 3rd Qu.:39.16 3rd Qu.:20.585 3rd Qu.:25.93
Max. :37.97 Max. :70.39 Max. :42.950 Max. :31.55
iacid oacid total_acids lon
Min. :0.760 Min. :1.090 Min. : 1.520 Min. :-120.7
1st Qu.:1.835 1st Qu.:2.978 1st Qu.: 4.930 1st Qu.:-118.8
Median :2.825 Median :4.135 Median : 6.370 Median :-117.7
Mean :2.788 Mean :4.342 Mean : 6.708 Mean :-118.3
3rd Qu.:3.817 3rd Qu.:5.982 3rd Qu.: 9.395 3rd Qu.:-117.4
Max. :4.620 Max. :7.400 Max. :11.430 Max. :-116.8
lat
Min. :32.84
1st Qu.:33.93
Median :34.10
Mean :34.20
3rd Qu.:34.65
Max. :35.49
Create a new categorical variable named “obesity_level” using the BMI measurement (underweight BMI<14; normal BMI 14-22; overweight BMI 22-24; obese BMI>24).To make sure the variable is rightly coded, create a summary table that contains the minimum BMI, maximum BMI, and the total number of observations per category.
# creating the obesity_level variable based on BMImerged_data <- merged_data %>%mutate(obesity_level =case_when( bmi <14~"underweight", bmi >=14& bmi <=22~"normal", bmi >22& bmi <=24~"overweight", bmi >24~"obese" ))# checking if the obesity_level variable was created correctly# summary table with min, max, and count per categoryobesity_summary_table <- merged_data %>%group_by(obesity_level) %>%summarise(min_bmi =min(bmi, na.rm =TRUE),max_bmi =max(bmi, na.rm =TRUE),count =n() )# display the summary tablekable(obesity_summary_table)
obesity_level
min_bmi
max_bmi
count
normal
14.00380
21.96387
975
obese
24.00647
41.26613
103
overweight
22.02353
23.99650
87
underweight
11.29640
13.98601
35
There are 975 individuals in the normal category, 103 in the obese category, 87 in the overweight category, and 35 in the underweight category. Summed up, this is 1200, meaning that all observations were accounted for. The minimum and maximum of each obesity level also follow the cut-off values accordingly.
Create another categorical variable named “smoke_gas_exposure” that summarizes “Second Hand Smoke” and “Gas Stove.” The variable should have four categories in total.
# creating the smoke_gas_exposure variablemerged_data <- merged_data %>%mutate(smoke_gas_exposure =case_when( smoke ==0& gasstove ==0~"No exposure", smoke ==1& gasstove ==0~"Smoke exposure only", smoke ==0& gasstove ==1~"Gas stove exposure only", smoke ==1& gasstove ==1~"Both exposures" ))# summary table checking the counts in each category of smoke_gas_exposuresmoke_summary_table <- merged_data %>%group_by(smoke_gas_exposure) %>%summarise(count =n())# display the summary tablekable(smoke_summary_table)
smoke_gas_exposure
count
Both exposures
154
Gas stove exposure only
791
No exposure
219
Smoke exposure only
36
The dataset reveals that the majority of individuals (791) were exposed to gas stove emissions only, while a smaller group of 154 individuals experienced both smoke and gas stove exposure. Additionally, 219 individuals had no exposure to either, and the smallest group, comprising just 36 individuals, was exposed solely to smoke. These totals sum to 1,200, confirming that all observations are accounted for.
Create four summary tables showing the average (or proportion, if binary) and sd of “Forced expiratory volume in 1 second (ml)” (an asthma indicator) by town, sex, obesity level, and “smoke_gas_exposure.”
# function to generate summary tablegenerate_summary <-function(group_var) { merged_data %>%group_by(!!sym(group_var)) %>%summarise(mean_fev =mean(fev, na.rm =TRUE),sd_fev =sd(fev, na.rm =TRUE),n =n() )}# variables to group bygroup_vars <-c("townname", "male", "obesity_level", "smoke_gas_exposure")# loop through each grouping variable and generate summary tablessummary_tables <-lapply(group_vars, function(var) { summary_table <-generate_summary(var)# print the summary tablecat("\nSummary by", var, ":\n")print(kable(summary_table)) # Use print() to display the table in some contexts})
The primary questions of interest are: What is the association between BMI and FEV (forced expiratory volume)? What is the association between smoke and gas exposure and FEV? What is the association between PM2.5 exposure and FEV?
Follow the EDA checklist from week 3 and the previous assignment. Be sure to focus on the key variables.
# check the dimensions of the datadim(merged_data)
# get summary statistics for numeric variablessummary(merged_data)
townname sid male race
Length:1200 Min. : 1.0 Min. :0.0000 Length:1200
Class :character 1st Qu.: 528.8 1st Qu.:0.0000 Class :character
Mode :character Median :1041.5 Median :0.0000 Mode :character
Mean :1037.5 Mean :0.4917
3rd Qu.:1554.2 3rd Qu.:1.0000
Max. :2053.0 Max. :1.0000
hispanic agepft height weight
Min. :0.0000 Min. : 8.961 Min. :114 Min. : 42.00
1st Qu.:0.0000 1st Qu.: 9.632 1st Qu.:135 1st Qu.: 66.00
Median :0.0000 Median : 9.907 Median :139 Median : 76.00
Mean :0.4342 Mean : 9.923 Mean :139 Mean : 79.31
3rd Qu.:1.0000 3rd Qu.:10.155 3rd Qu.:143 3rd Qu.: 87.00
Max. :1.0000 Max. :12.731 Max. :165 Max. :207.00
bmi asthma active_asthma father_asthma
Min. :11.30 Min. :0.0000 Min. :0.00 Min. :0.00000
1st Qu.:15.96 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.00000
Median :17.81 Median :0.0000 Median :0.00 Median :0.00000
Mean :18.50 Mean :0.1425 Mean :0.19 Mean :0.07583
3rd Qu.:19.99 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.:0.00000
Max. :41.27 Max. :1.0000 Max. :1.00 Max. :1.00000
mother_asthma wheeze hayfever allergy
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.0975 Mean :0.3117 Mean :0.1575 Mean :0.2775
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
educ_parent smoke pets gasstove
Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :3.000 Median :0.0000 Median :1.0000 Median :1.0000
Mean :2.808 Mean :0.1583 Mean :0.7667 Mean :0.7875
3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :5.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
fev fvc mmef pm25_mass
Min. : 984.8 Min. : 895 Min. : 757.6 Min. : 5.960
1st Qu.:1827.6 1st Qu.:2065 1st Qu.:2043.7 1st Qu.: 7.615
Median :2016.4 Median :2282 Median :2392.1 Median :10.545
Mean :2030.1 Mean :2322 Mean :2398.6 Mean :14.362
3rd Qu.:2223.6 3rd Qu.:2551 3rd Qu.:2737.1 3rd Qu.:20.988
Max. :3323.7 Max. :3698 Max. :4935.9 Max. :29.970
pm25_so4 pm25_no3 pm25_nh4 pm25_oc
Min. :0.790 Min. : 0.730 Min. :0.4100 Min. : 1.450
1st Qu.:1.077 1st Qu.: 1.538 1st Qu.:0.7375 1st Qu.: 2.520
Median :1.815 Median : 2.525 Median :1.1350 Median : 4.035
Mean :1.876 Mean : 4.488 Mean :1.7642 Mean : 4.551
3rd Qu.:2.605 3rd Qu.: 7.338 3rd Qu.:2.7725 3rd Qu.: 5.350
Max. :3.230 Max. :12.200 Max. :4.2500 Max. :11.830
pm25_ec pm25_om pm10_oc pm10_ec
Min. :0.1300 Min. : 1.740 Min. : 1.860 Min. :0.1400
1st Qu.:0.4000 1st Qu.: 3.020 1st Qu.: 3.228 1st Qu.:0.4100
Median :0.5850 Median : 4.840 Median : 5.170 Median :0.5950
Mean :0.7358 Mean : 5.460 Mean : 5.832 Mean :0.7525
3rd Qu.:1.1750 3rd Qu.: 6.418 3rd Qu.: 6.855 3rd Qu.:1.1975
Max. :1.3600 Max. :14.200 Max. :15.160 Max. :1.3900
pm10_tc formic acetic hcl
Min. : 1.990 Min. :0.340 Min. :0.750 Min. :0.2200
1st Qu.: 3.705 1st Qu.:0.720 1st Qu.:2.297 1st Qu.:0.3250
Median : 6.505 Median :1.105 Median :2.910 Median :0.4350
Mean : 6.784 Mean :1.332 Mean :3.010 Mean :0.4208
3rd Qu.: 8.430 3rd Qu.:1.765 3rd Qu.:4.000 3rd Qu.:0.4625
Max. :16.440 Max. :2.770 Max. :5.140 Max. :0.7300
hno3 o3_max o3106 o3_24
Min. :0.430 Min. :38.27 Min. :28.22 Min. :18.22
1st Qu.:1.593 1st Qu.:49.93 1st Qu.:41.90 1st Qu.:23.31
Median :2.455 Median :64.05 Median :46.74 Median :27.59
Mean :2.367 Mean :60.16 Mean :47.76 Mean :30.23
3rd Qu.:3.355 3rd Qu.:67.69 3rd Qu.:55.24 3rd Qu.:32.39
Max. :4.070 Max. :84.44 Max. :67.01 Max. :57.76
no2 pm10 no_24hr pm2_5_fr
Min. : 4.60 Min. :18.40 Min. : 2.050 Min. : 9.01
1st Qu.:12.12 1st Qu.:20.71 1st Qu.: 6.487 1st Qu.:13.47
Median :16.40 Median :29.64 Median :14.080 Median :19.94
Mean :18.99 Mean :32.64 Mean :16.218 Mean :19.80
3rd Qu.:23.24 3rd Qu.:39.16 3rd Qu.:20.585 3rd Qu.:25.93
Max. :37.97 Max. :70.39 Max. :42.950 Max. :31.55
iacid oacid total_acids lon
Min. :0.760 Min. :1.090 Min. : 1.520 Min. :-120.7
1st Qu.:1.835 1st Qu.:2.978 1st Qu.: 4.930 1st Qu.:-118.8
Median :2.825 Median :4.135 Median : 6.370 Median :-117.7
Mean :2.788 Mean :4.342 Mean : 6.708 Mean :-118.3
3rd Qu.:3.817 3rd Qu.:5.982 3rd Qu.: 9.395 3rd Qu.:-117.4
Max. :4.620 Max. :7.400 Max. :11.430 Max. :-116.8
lat obesity_level smoke_gas_exposure
Min. :32.84 Length:1200 Length:1200
1st Qu.:33.93 Class :character Class :character
Median :34.10 Mode :character Mode :character
Mean :34.20
3rd Qu.:34.65
Max. :35.49
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.960 7.615 10.545 14.362 20.988 29.970
From the summary tables, we can see that the imputed variables of obesity_level and smoke_gas_exposure have no NA values, signifying that the data was imputed correctly. Likewise, we can also see that the third variable of interest, pm25_mass, has no missing values.
Association Between BMI and FEV
# scatter plot of BMI vs. FEVggplot(merged_data, aes(x = bmi, y = fev)) +geom_point(alpha =0.7, color ="pink", size =3) +geom_smooth(method ="lm", color ="turquoise", fill ="lightblue", se =TRUE) +labs(title ="Scatter Plot of BMI vs. Forced Expiratory Volume (ml)",x ="BMI",y ="Forced Expiratory Volume (ml)" )
`geom_smooth()` using formula = 'y ~ x'
# calculate correlation between BMI and FEVcor_bmi_fev <-cor(merged_data$bmi, merged_data$fev, use ="complete.obs")print(paste("Correlation between BMI and FEV:", cor_bmi_fev))
[1] "Correlation between BMI and FEV: 0.357112030981777"
# linear regression modelmodel_bmi_fev <-lm(fev ~ bmi, data = merged_data)summary(model_bmi_fev)
Call:
lm(formula = fev ~ bmi, data = merged_data)
Residuals:
Min 1Q Median 3Q Max
-989.47 -193.88 -11.32 179.43 1305.46
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1452.46 44.49 32.65 <2e-16 ***
bmi 31.23 2.36 13.23 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 297.2 on 1198 degrees of freedom
Multiple R-squared: 0.1275, Adjusted R-squared: 0.1268
F-statistic: 175.1 on 1 and 1198 DF, p-value: < 2.2e-16
The correlation coefficient of 0.357 between BMI and FEV suggests a moderate positive linear relationship. This implies that higher BMI values are associated with increased FEV, but the strength of the association is not particularly robust. A correlation of this magnitude indicates that BMI accounts for some variation in FEV, but it is far from being the dominant predictor. The scatter plot shows this trend visually, with a slight upward slope in the fitted line, confirming the positive correlation. The R-squared value of 0.1275 means that about 12.75% of the variance in FEV is explained by BMI, indicating that while BMI is a significant predictor, it only accounts for a small portion of the variation in FEV. This result aligns with the correlation coefficient of 0.357, suggesting a moderate positive linear relationship as other factors not captured in this analysis may also play a significant role in determining FEV.
However, it’s essential to note that given this analysis is exploratory data analysis (EDA), we are primarily concerned with understanding associations and correlations rather than establishing statistical significance.
Association Between Smoke and Gas Exposure and FEV
# calculate mean FEV for each smoke/gas exposure groupmean_fev_data <- merged_data %>%group_by(smoke_gas_exposure) %>%summarize(mean_fev =mean(fev, na.rm =TRUE))# boxplot of Smoke and Gas Exposure vs FEVggplot(merged_data, aes(x = smoke_gas_exposure, y = fev, fill = smoke_gas_exposure)) +geom_boxplot(alpha =0.5, outlier.color ="black") +labs(title ="Boxplot of Smoke and Gas Exposure vs. Forced Expiratory Volume (ml)",x ="Smoke and Gas Exposure",y ="Forced Expiratory Volume (ml)") +guides(fill =guide_legend(title ="Smoke/Gas Exposure")) +theme_minimal(base_size =10) +stat_summary(fun = mean, geom ="point", shape =20, size =3, color ="black", fill ="black") +geom_text(data = mean_fev_data, aes(x = smoke_gas_exposure, y = mean_fev, label =round(mean_fev, 1)),vjust =-0.75, color ="black", size =3)
# perform ANOVA to test for association between smoke and gas exposure and FEVanova_result <-aov(fev ~ smoke_gas_exposure, data = merged_data)summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
smoke_gas_exposure 3 226515 75505 0.746 0.525
Residuals 1196 121017980 101186
# linear regression modelmodel_smoke_fev <-lm(fev ~ smoke_gas_exposure, data = merged_data)summary(model_smoke_fev)
Call:
lm(formula = fev ~ smoke_gas_exposure, data = merged_data)
Residuals:
Min 1Q Median 3Q Max
-1037.8 -205.3 -12.0 190.7 1301.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2024.778 25.633 78.991 <2e-16
smoke_gas_exposureGas stove exposure only -2.107 28.017 -0.075 0.940
smoke_gas_exposureNo exposure 31.915 33.453 0.954 0.340
smoke_gas_exposureSmoke exposure only 30.936 58.888 0.525 0.599
(Intercept) ***
smoke_gas_exposureGas stove exposure only
smoke_gas_exposureNo exposure
smoke_gas_exposureSmoke exposure only
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 318.1 on 1196 degrees of freedom
Multiple R-squared: 0.001868, Adjusted R-squared: -0.0006354
F-statistic: 0.7462 on 3 and 1196 DF, p-value: 0.5246
The box plot visually compares FEV across four smoke and gas exposure categories: both exposures, gas stove only, no exposure, and smoke exposure only. The median FEV values are relatively similar across all groups, ranging from about 2022 ml to 2056 ml. The interquartile ranges (IQRs) and the overall distribution of FEV also appear similar, indicating minimal variation between exposure groups. The small differences in medians suggest that smoke and gas exposure has little effect on FEV. The R-squared value of 0.0019 indicates that exposure explains less than 0.2% of the variation in FEV, suggesting other factors are more influential.
However, it’s essential to note that given we are conducting an EDA, we are to prioritize understanding associations and correlations rather than establishing statistical significance at this stage of the analysis.
Association Between PM2.5 Exposure and FEV
# calculate the correlation between PM2.5 exposure and FEVcor_pm25_fev <-cor(merged_data$pm25_mass, merged_data$fev, use ="complete.obs")print(paste("Correlation between PM2.5 and FEV:", cor_pm25_fev))
[1] "Correlation between PM2.5 and FEV: -0.0734131664601442"
# linear regression modelmodel_pm25_fev <-lm(fev ~ pm25_mass, data = merged_data)summary(model_pm25_fev)
Call:
lm(formula = fev ~ pm25_mass, data = merged_data)
Residuals:
Min 1Q Median 3Q Max
-1070.63 -206.57 -13.33 195.87 1277.65
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2073.452 19.309 107.383 <2e-16 ***
pm25_mass -3.016 1.184 -2.548 0.011 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 317.3 on 1198 degrees of freedom
Multiple R-squared: 0.005389, Adjusted R-squared: 0.004559
F-statistic: 6.492 on 1 and 1198 DF, p-value: 0.01096
# scatter plot to visualize the relationshipggplot(merged_data, aes(x = pm25_mass, y = fev)) +geom_point(alpha =0.5, color ="pink") +geom_smooth(method ="lm", color ="coral", se =FALSE) +labs(title ="Scatter Plot of PM2.5 Exposure vs Forced Expiratory Volume (ml)",x ="PM2.5 Exposure (µg/m³)",y ="Forced Expiratory Volume (ml)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
The scatter plot illustrates the relationship between PM2.5 exposure and FEV. The fitted line, which appears nearly horizontal, indicates a weak relationship between PM2.5 exposure and FEV. The correlation coefficient of approximately -0.073 suggests a very weak negative linear association, implying that changes in PM2.5 exposure are not meaningfully associated with changes in FEV.
As was the case for the previous plots, we are conducting an EDA so we choose to prioritize understanding associations and correlations rather than establishing statistical significance at this stage of the analysis.
Visualization
Facet plot showing scatterplots with regression lines of BMI vs FEV by “townname”.
# calculate correlation coefficients by towncorrelation_values <- merged_data %>%group_by(townname) %>%summarise(correlation =cor(fev, bmi, use ="complete.obs") )# print correlation valuesprint(correlation_values)
# A tibble: 12 × 2
townname correlation
<chr> <dbl>
1 Alpine 0.184
2 Atascadero 0.383
3 Lake Elsinore 0.545
4 Lake Gregory 0.349
5 Lancaster 0.383
6 Lompoc 0.331
7 Long Beach 0.410
8 Mira Loma 0.371
9 Riverside 0.257
10 San Dimas 0.444
11 Santa Maria 0.380
12 Upland 0.385
# plot BMI vs. FEV by townmerged_data %>%ggplot(mapping =aes(x = bmi, y = fev, color = townname)) +geom_point(alpha =0.5) +geom_smooth(method = lm, color ="black") +facet_wrap(~ townname) +xlab("Body Mass Index") +ylab("Forced Expiratory Volume (ml)") +ggtitle("Scatterplots of BMI vs Forced Expiratory Volume (ml) by Town")
`geom_smooth()` using formula = 'y ~ x'
The facet plot visualizes the relationship between Body Mass Index (BMI) and Forced Expiratory Volume (FEV) across twelve towns, with each scatterplot showing a positive correlation between the two variables. The strength of the association varies by town, with Lake Elsinore (0.5446) showing the strongest positive correlation, indicating that higher BMI is more strongly associated with increased FEV in this town. Other towns like San Dimas (0.4443) and Long Beach (0.4104) also display moderate correlations. Weaker correlations are seen in Riverside (0.2571) and Alpine (0.1845), where BMI has a smaller effect on FEV. Overall, the data suggests a positive, though modest, relationship between BMI and lung function, with some variability across towns.
Stacked histograms of FEV by BMI category and FEV by smoke/gas exposure. Use different color schemes than the ggplot default.
stacked histogram for FEV by BMI category
# stacked histogram for FEV by BMI categoryggplot(merged_data, aes(x = fev, fill = obesity_level)) +geom_histogram(position ="stack", bins =30, alpha =0.7) +scale_fill_manual(values =c("lightblue", "lightgreen", "lightcoral", "lightpink")) +labs(title ="Stacked Histogram of Forced Expiratory Volume (ml) by BMI Category",x ="Forced Expiratory Volume (ml)",y ="Count") +guides(fill =guide_legend(title ="Obesity Category"))
In the stacked histogram of FEV by BMI category, we observe that the distribution of FEV shifts based on obesity levels. Only the “normal” obesity category shows a relatively normal distribution, the “obese” group slightly normal but centered at a different value in comparison. The “normal” BMI group is centered around an FEV of 1800-2100 ml, while the “underweight” group has most of its FEV values between 1500-2000 ml. However, the “underweight” category has the fewest observations, which could affect its distribution in this dataset as it doesn’t appear normal. Both the “obese” and “overweight” groups have their FEV distributions centered around 2000-2500 ml. Additionally, we notice a potential outlier in the “obese” group with an FEV above 3250 ml, which contrasts with the group’s typical FEV range. Overall, The stacked histogram demonstrates that while the centers of the distributions for each obesity group are distinct, there is still considerable overlap in the FEV ranges across groups. For example, individuals in the “normal,”“overweight,” and “obese” categories can all have FEVs around 2000 ml.
Stacked histogram for FEV by smoke/gas exposure
# stacked histogram for FEV by smoke/gas exposureggplot(merged_data, aes(x = fev, fill = smoke_gas_exposure)) +geom_histogram(position ="stack", bins =30, alpha =0.7) +scale_fill_manual(values =c("lightblue", "lightgreen", "lightcoral", "lightpink")) +labs(title ="Stacked Histogram of Forced Expiratory Volume (ml) by Smoke/Gas Exposure",x ="Forced Expiratory Volume (ml)",y ="Count") +guides(fill =guide_legend(title ="Smoke/Gas Exposure"))
This histogram shows FEV distribution across four categories of smoke and gas exposure: both exposures, gas stove exposure only, no exposure, and smoke exposure only. The largest counts fall within the “gas stove exposure only” category, which dominates the central portion of the FEV range, from 1500 to 3000 ml. The “no exposure” category is more evenly spread, while “both exposures” and “smoke exposure only” have smaller counts overall. Although these distributions appear normal, there is little distinction in where they are centered based on smoke or gas exposure. This suggests there isn’t compelling evidence from this dataset that smoke or gas exposure significantly influences FEV, as the differences between the exposure groups are not as pronounced as those seen in the BMI categories.
Barchart of BMI by smoke/gas exposure.
merged_data %>%ggplot(aes(x = obesity_level, fill = smoke_gas_exposure)) +geom_bar(position ="dodge", alpha =0.5) +scale_fill_manual(values =c("lightblue", "lightgreen", "lightcoral", "lightpink")) +labs(title ="Bar Chart of BMI by Smoke and Gas Exposure",x ="BMI Categories",y ="Count",fill ="Smoke and Gas Exposure" ) +theme_minimal()
The bar chart illustrates the distribution of BMI categories (normal, obese, overweight, underweight) based on exposure to smoke and gas from stoves. The data reveals that individuals with normal BMI are predominantly exposed to gas stove only, with a notable portion having no exposure. In contrast, the number of individuals exposed to both smoke and gas or smoke alone is significantly lower across all BMI categories. Among those classified as obese or overweight, the trend remains consistent, with gas stove exposure being the most common factor, followed by no exposure. In the underweight category, there is a smaller population, primarily exposed to gas stove only as is the case across te other BMI categories. From this plot, there is no evidence that obesity level varies by smoke or gas exposure, as the distribution of smoke and gas exposures seems consistent across obesity groups.
Statistical summary graphs of FEV by BMI and FEV by smoke/gas exposure category. ## Statistical summary graph of FEV by BMI
# calculate mean and standard deviation of FEV by BMIbmi_summary <- merged_data %>%group_by(obesity_level) %>%summarise(mean_fev =mean(fev, na.rm =TRUE)) %>%ungroup()# scatter plot with summaryggplot(data = merged_data, aes(x = obesity_level, y = fev, fill = obesity_level)) +geom_boxplot(alpha =0.5, outlier.color ="black") +labs(title ="Boxplot of Forced Expiratory Volume (ml) by Obesity Level",x ="Obesity Level",y ="Forced Expiratory Volume (ml)") +guides(fill =guide_legend(title ="Obesity Category")) +scale_fill_brewer(palette ="Pastel2") +# Added missing `+`geom_text(data = bmi_summary, aes(x = obesity_level, y = mean_fev, label =round(mean_fev, 1)),vjust =-0.75, color ="black", size =3) +theme_minimal(base_size =15)
This boxplot shows that individuals classified as “obese’ have the highest median FEV (2266.2 ml), followed by those who are “overweight” (2224.3 ml). People with normal BMI have a lower median FEV (1999.8 ml), while the “underweight” group has the lowest median value (1698.3 ml). The data also reveal that “normal”-weight and “obese”“ individuals exhibit greater variability in FEV, as evidenced by the presence of more outliers in these groups.
Statistical summary graph of FEV by smoke/gas exposure category
# calculate mean and standard deviation of FEV by smoke/gas exposure categorysmokegas_summary <- merged_data %>%group_by(smoke_gas_exposure) %>%summarise(mean_fev =mean(fev, na.rm =TRUE))# boxplot of FEV by smoke/gas exposure categoryggplot(data = merged_data, aes(x = smoke_gas_exposure, y = fev, fill = smoke_gas_exposure)) +geom_boxplot(alpha =0.5, outlier.color ="black") +labs(title ="Boxplot of Forced Expiratory Volume (ml) by Smoke and Gas Exposure Category",x ="Smoke and Gas Exposure",y ="Forced Expiratory Volume (ml)") +guides(fill =guide_legend(title ="Smoke/Gas Exposure Category")) +# Corrected legend titlescale_fill_brewer(palette ="Set3") +geom_text(data = smokegas_summary, aes(x = smoke_gas_exposure, y = mean_fev, label =round(mean_fev, 1)),vjust =-0.75, color ="black", size =3) +theme_minimal(base_size =10)
In contrast to the previous boxplot, there is minimal difference in the median FEV across the smoke and gas exposure categories. Those with no exposure (2056.7 ml) and smoke exposure only (2055.7 ml) have slightly higher FEV values compared to individuals with both exposures (2024.8 ml) or gas stove exposure only (2022.7 ml). The overall spread of FEV is similar across all exposure groups, though both the no exposure and both exposures categories show a higher number of outliers, indicating that other factors may contribute to variability in lung function.
In summary, BMI appears to have a more noticeable impact on FEV, with obese and overweight individuals displaying higher lung function. On the other hand, differences in smoke and gas exposure seem to have a smaller influence on FEV given overlapping data, though some variability exists across the exposure categories.
A leaflet map showing the concentrations of PM2.5 mass in each of the CHS communities.
# create a color palette based on PM2.5 mass concentrationpal <-colorNumeric(palette ="viridis", domain = merged_data$pm25_mass)# create the leaflet mapleaflet(merged_data) %>%addTiles() %>%addCircleMarkers(~lon, ~lat, radius =5, color =~pal(pm25_mass), stroke =FALSE, fillOpacity =0.6, popup =~paste("PM2.5 Mass Concentration: ", round(pm25_mass, 2)) ) %>%addLegend("bottomright", pal = pal, values =~pm25_mass,title ="PM2.5 Mass Concentration",opacity =1)
Based on the map, it appears that site with the highest PM2.5 mass is located at one of the more eastern locations in California. We can retrieve this value among others.
Mira Loma exhibits the highest PM2.5 mass levels in this dataset. Based on online sources, it is recognized as one of the most polluted cities in Southern California in terms of PM2.5 pollution, largely due to its close proximity to the Ontario Freeway. This observation is corroborated by the data in this dataset.
Other cities with above-average PM2.5 levels include Long Beach, Riverside, San Dimas, and Upland. Like Mira Loma, these cities are located near major freeways, which contributes to elevated PM2.5 levels. In contrast, areas with lower PM2.5 concentrations are seen to be closer to the coast or situated further inland.
Choose a visualization to examine whether PM2.5 mass is associated with FEV.
# calculate correlation coefficientcor_pm25_fev <-cor(merged_data$pm25_mass, merged_data$fev, use ="complete.obs")print(paste("Correlation between PM2.5 Mass and FEV:", cor_pm25_fev))
[1] "Correlation between PM2.5 Mass and FEV: -0.0734131664601442"
# scatterplot with smoothing lineggplot(data = merged_data, mapping =aes(x = pm25_mass, y = fev)) +geom_point(alpha =0.5) +geom_smooth(method ="loess", col ="pink", se =FALSE) +labs(title ="Scatterplot of PM2.5 Mass vs Forced Expiratory Volume (ml)", x ="PM2.5 Mass (µg/m³)", y ="Forced Expiratory Volume (ml)") +xlim(5.96, 29.97) +annotate("text", x =10, y =max(merged_data$fev, na.rm =TRUE), label =paste("Correlation Coefficient:", round(cor_pm25_fev, 2)), color ="black", size =4)
`geom_smooth()` using formula = 'y ~ x'
A scatter plot is the ideal choice to examine the relationship between PM2.5 mass and FEV because both are continuous variables. The scatter plot reveals a slight negative relationship, although weak, between PM2.5 mass and FEV, as indicated by the downward slope of the linear regression line with a value of -0.07. Notably, the mean FEV1 for the participants exposed to <= 10 µg/m³ was the highest and those exposed to 30 µg/m³ was the lowest. This suggests that as PM2.5 mass increases, FEV tends to decrease slightly, indicating a possible adverse effect of air pollution on lung function.